NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Decoding biology with massively parallel reporter assays and machine learning

https://doi.org/10.1101/gad.351800.124

La_Fleur, Alyssa; Shi, Yongsheng; Seelig, Georg (September 2024, Genes & Development)

Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set. Models can provide a quantitative understanding ofcis-regulatory codes controlling gene expression, enable variant stratification, and guide the design of synthetic regulatory elements for applications from synthetic biology to mRNA and gene therapy. This review focuses oncis-regulatory MPRAs, particularly those that interrogate cotranscriptional and post-transcriptional processes: alternative splicing, cleavage and polyadenylation, translation, and mRNA decay.
more » « less
Full Text Available
RBBP6 anchors pre-mRNA 3′ end processing to nuclear speckles for efficient gene expression

https://doi.org/10.1016/j.molcel.2024.12.016

Yoon, Yoseop; Bournique, Elodie; Soles, Lindsey V; Yin, Hong; Chu, Hsu-Feng; Yin, Christopher; Zhuang, Yinyin; Liu, Xiangyang; Liu, Liang; Jeong, Joshua; et al (February 2025, Molecular Cell)

Free, publicly-accessible full text available February 1, 2026
Optimizing 5’UTRs for mRNA-delivered gene editing using deep learning

https://doi.org/10.1038/s41467-024-49508-2

Castillo-Hair, Sebastian; Fedak, Stephen; Wang, Ban; Linder, Johannes; Havens, Kyle; Certo, Michael; Seelig, Georg (June 2024, Nature Communications)

Abstract mRNA therapeutics are revolutionizing the pharmaceutical industry, but methods to optimize the primary sequence for increased expression are still lacking. Here, we design 5’UTRs for efficient mRNA translation using deep learning. We perform polysome profiling of fully or partially randomized 5’UTR libraries in three cell types and find that UTR performance is highly correlated across cell types. We train models on our datasets and use them to guide the design of high-performing 5’UTRs using gradient descent and generative neural networks. We experimentally test designed 5’UTRs with mRNA encoding megaTAL^TMgene editing enzymes for two different gene targets and in two different cell lines. We find that the designed 5’UTRs support strong gene editing activity. Editing efficiency is correlated between cell types and gene targets, although the best performing UTR was specific to one cargo and cell type. Our results highlight the potential of model-based sequence design for mRNA therapeutics.
more » « less
A nanopore interface for higher bandwidth DNA computing

https://doi.org/10.1038/s41467-022-32526-3

Zhang, Karen; Chen, Yuan-Jyue; Wilde, Delaney; Doroschak, Kathryn; Strauss, Karin; Ceze, Luis; Seelig, Georg; Nivala, Jeff (December 2022, Nature Communications)

Abstract DNA has emerged as a powerful substrate for programming information processing machines at the nanoscale. Among the DNA computing primitives used today, DNA strand displacement (DSD) is arguably the most popular, with DSD-based circuit applications ranging from disease diagnostics to molecular artificial neural networks. The outputs of DSD circuits are generally read using fluorescence spectroscopy. However, due to the spectral overlap of typical small-molecule fluorescent reporters, the number of unique outputs that can be detected in parallel is limited, requiring complex optical setups or spatial isolation of reactions to make output bandwidths scalable. Here, we present a multiplexable sequencing-free readout method that enables real-time, kinetic measurement of DSD circuit activity through highly parallel, direct detection of barcoded output strands using nanopore sensor array technology (Oxford Nanopore Technologies’ MinION device). These results increase DSD output bandwidth by an order of magnitude over what is currently feasible with fluorescence spectroscopy.
more » « less
Full Text Available
Deciphering the impact of genetic variation on human polyadenylation using APARENT2

https://doi.org/10.1186/s13059-022-02799-4

Linder, Johannes; Koplik, Samantha E.; Kundaje, Anshul; Seelig, Georg (November 2022, Genome Biology)

Abstract Background3′-end processing by cleavage and polyadenylation is an important and finely tuned regulatory process during mRNA maturation. Numerous genetic variants are known to cause or contribute to human disorders by disrupting the cis-regulatory code of polyadenylation signals. Yet, due to the complexity of this code, variant interpretation remains challenging. ResultsWe introduce a residual neural network model,APARENT2, that can infer 3′-cleavage and polyadenylation from DNA sequence more accurately than any previous model. This model generalizes to the case of alternative polyadenylation (APA) for a variable number of polyadenylation signals. We demonstrate APARENT2’s performance on several variant datasets, including functional reporter data and human 3′ aQTLs from GTEx. We apply neural network interpretation methods to gain insights into disrupted or protective higher-order features of polyadenylation. We fine-tune APARENT2 on human tissue-resolved transcriptomic data to elucidate tissue-specific variant effects. By combining APARENT2 with models of mRNA stability, we extend aQTL effect size predictions to the entire 3′ untranslated region. Finally, we perform in silico saturation mutagenesis of all human polyadenylation signals and compare the predicted effects of$${>}43$$ $> 43$ million variants against gnomAD. While loss-of-function variants were generally selected against, we also find specific clinical conditions linked to gain-of-function mutations. For example, we detect an association between gain-of-function mutations in the 3′-end and autism spectrum disorder. To experimentally validate APARENT2’s predictions, we assayed clinically relevant variants in multiple cell lines, including microglia-derived cells. ConclusionsA sequence-to-function model based on deep residual learning enables accurate functional interpretation of genetic variants in polyadenylation signals and, when coupled with large human variation databases, elucidates the link between functional 3′-end mutations and human health.
more » « less
CellMeSH: probabilistic cell-type identification using indexed literature

https://doi.org/10.1093/bioinformatics/btab834

Mao, Shunfu; Zhang, Yue; Seelig, Georg; Kannan, Sreeram (February 2022, Bioinformatics)
Birol, Inanc (Ed.)
Abstract MotivationSingle-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad hoc effort that requires expert biological knowledge. ResultsHere, we introduce CellMeSH—a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene–cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene–cell-type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches. Availability and implementationWeb server at https://uncurl.cs.washington.edu/db_query and API at https://github.com/shunfumao/cellmesh. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Interpreting neural networks for biological sequences by learning stochastic masks

https://doi.org/10.1038/s42256-021-00428-6

Linder, Johannes; La Fleur, Alyssa; Chen, Zibo; Ljubetič, Ajasja; Baker, David; Kannan, Sreeram; Seelig, Georg (January 2022, Nature Machine Intelligence)

Full Text Available
Combined Amplification and Molecular Classification for Gene Expression Diagnostics

https://doi.org/10.1007/978-3-030-26807-7_9

Gowri, Gokul; Lopez, Randolph; Seelig, Georg (July 2019, DNA Computing and Molecular Programming. DNA 2019. Lecture Notes in Computer Science,)
Thachuk, Chris; Liu, Yan (Ed.)
RNA expression profiles contain information about the state of a cell and specific gene expression changes are often associated with disease. Classification of blood or similar samples based on RNA expression can thus be a powerful method for disease diagnosis. However, basing diagnostic decisions on RNA expression remains impractical for most clinical applications because it requires costly and slow gene expression profiling based on microarrays or next generation sequencing followed by often complex in silico analysis. DNA-based molecular classifiers that perform a computation over RNA inputs and summarize a diagnostic result in situ have been developed to address this issue, but lack the sensitivity required for use with actual biological samples. To address this limitation, we here propose a DNA-based classification system that takes advantage of PCR-based amplification for increased sensitivity. In our initial scheme, the importance of a transcript for a diagnostic decision is proportional to the number of molecular probes bound to that transcript. Although probe concentration is similar to that of the RNA input, subsequent amplification of the probes with PCR can dramatically increase the sensitivity of the assay. However, even slight biases in PCR efficiency can distort weight information encoded by the original probe set. To address this concern, we developed and mathematically analyzed multiple strategies for mitigating the bias associated with PCR-based amplification. We evaluate these amplified molecular classification strategies through simulation using two distinct gene expression data sets and associated disease categories as inputs. Through this analysis, we arrive at a novel molecular classifier framework that naturally accommodates PCR bias and also uses a smaller number of molecular probes than required in the initial, naive implementation.
more » « less
Full Text Available
A molecular multi-gene classifier for disease diagnostics

https://doi.org/10.1038/s41557-018-0056-1

Lopez, Randolph; Wang, Ruofan; Seelig, Georg (January 2018, Nature Chemistry)

Full Text Available
Scalable preprocessing for sparse scRNA-seq data exploiting prior knowledge

https://doi.org/10.1093/bioinformatics/bty293

Mukherjee, Sumit; Zhang, Yue; Fan, Joshua; Seelig, Georg; Kannan, Sreeram (June 2018, Bioinformatics)

Abstract MotivationSingle cell RNA-seq (scRNA-seq) data contains a wealth of information which has to be inferred computationally from the observed sequencing reads. As the ability to sequence more cells improves rapidly, existing computational tools suffer from three problems. (i) The decreased reads-per-cell implies a highly sparse sample of the true cellular transcriptome. (ii) Many tools simply cannot handle the size of the resulting datasets. (iii) Prior biological knowledge such as bulk RNA-seq information of certain cell types or qualitative marker information is not taken into account. Here we present UNCURL, a preprocessing framework based on non-negative matrix factorization for scRNA-seq data, that is able to handle varying sampling distributions, scales to very large cell numbers and can incorporate prior knowledge. ResultsWe find that preprocessing using UNCURL consistently improves performance of commonly used scRNA-seq tools for clustering, visualization and lineage estimation, both in the absence and presence of prior knowledge. Finally we demonstrate that UNCURL is extremely scalable and parallelizable, and runs faster than other methods on a scRNA-seq dataset containing 1.3 million cells. Availability and implementationSource code is available at https://github.com/yjzhang/uncurl_python. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less

Search for: All records